-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally return compression statistics from ORC and Parquet writers #13294
Optionally return compression statistics from ORC and Parquet writers #13294
Conversation
…fea-compression-stats
…fea-compression-stats
…into fea-compression-stats
…fea-compression-stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes LGTM
cpp/include/cudf/io/orc.hpp
Outdated
* | ||
* @param comp_stats Pointer to compression statistics to be updated after writing | ||
*/ | ||
void set_compression_statistics(std::shared_ptr<writer_compression_statistics> const& comp_stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Please remind me what the right call is here for parameters: passing by value vs const reference. The methods above pass by value and std::move()
. We're paying for 1 copy anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got tripped up by the shared_ptr, but, yes, we should be able to pass by value + move here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the statistics is generated internally by the writer, isn't it? If so, why do we need this function, and also why do we need to have the shared_ptr
of that object in the input options to writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The statistics object is created by the user (see tests). That's why we have this shared lifetime management. Basically users registers an object for the writer to write the statistics into.
FWIW, this also means that a statistics object is reusable :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super. I just have a couple of nitpicks, and some clarifying questions.
*/ | ||
[[nodiscard]] auto compression_ratio() const noexcept | ||
{ | ||
return static_cast<double>(num_compressed_bytes()) / _num_compressed_output_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling for div by zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We let C++ handle it and return a nan
. I added a note about this behavior in the docs of compression_ratio
.
The alternative would be an optional, but I think nan
fits better here.
|
||
namespace cudf::io { | ||
|
||
writer_compression_statistics collect_compression_statistics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not shared_ptr
anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function creates a writer_compression_statistics object
, no need to know how its lifetime might be managed after it's created. If we think of this as a factory, we could return a unique_ptr
, but I don't think it's beneficial here.
* @param single_table_stats Statistics data from the current table | ||
* @param compression_stats Compression statistics from the current table | ||
*/ | ||
void update_statistics(size_type num_rows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add this declaration here? Can we make this a free function?
* | ||
* @param compression_stats Optional compression statistics from the current table | ||
*/ | ||
void update_compression_statistics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question, can we make this a free function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It changes the internal state of the impl
object, I think it fits better as a member function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks!
@@ -1136,7 +1136,7 @@ __global__ void __launch_bounds__(256) | |||
device_span<device_span<uint8_t const>> inputs, | |||
device_span<device_span<uint8_t>> outputs, | |||
device_span<compression_result> results, | |||
uint8_t* compressed_bfr, | |||
device_span<uint8_t> compressed_bfr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
/merge |
This adds JNI to expose compression statistics for ORC and Parquet writer, which was implemented in #13294. For doing so, `TableWriter` has been changed from an interface into an abstract class. Related derived writer classes are also updated accordingly. Closes #13252. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #13376
Description
Partially address #13252
Adds an option to ORC and Parquet writers to take a shared pointer to a
writer_compression_statistics
objects. This object is then updated with the compression statistics from the written file.In the case of chunked writing, statistics object is updated after each file. If a writing operation fails, statistics are not modified.
One statistics object can be reused in multiple write calls, but it should not be used concurrently in multiple threads.
Checklist